Multilingual Offensive Language Identification for Low-resource Languages

نویسندگان

چکیده

Offensive content is pervasive in social media and a reason for concern to companies government organizations. Several studies have been recently published investigating methods detect the various forms of such (e.g., hate speech, cyberbullying, cyberaggression). The clear majority these deal with English partially because most annotated datasets available contain data. In this article, we take advantage by applying cross-lingual contextual word embeddings transfer learning make predictions low-resource languages. We project on comparable data Arabic, Bengali, Danish, Greek, Hindi, Spanish, Turkish. report results 0.8415 F1 macro Bengali TRAC-2 shared task [23], 0.8532 Danish 0.8701 Greek OffensEval 2020 [58], 0.8568 Hindi HASOC 2019 [27], 0.7513 Spanish SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably best systems submitted recent tasks three Additionally, competitive performance Arabic Turkish using training development sets task. all languages confirm robustness

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Neural Machine Translation for Low Resource Languages

Neural Machine Translation (NMT) has been shown to be more effective in translation tasks compared to the Phrase-Based Statistical Machine Translation (PBMT). However, NMT systems are limited in translating low-resource languages (LRL), due to the fact that neural methods require a large amount of parallel data to learn effective mappings between languages. In this work we show how so-called mu...

متن کامل

Multilingual features based keyword search for very low-resource languages

In this paper we describe RWTH Aachen’s system for keyword search (KWS) with very limited amount of transcribed audio data available in the target language. This setting has become this year’s primary condition within the Babel project [1], seeking to minimize the amount of human effort while retaining a reasonable KWS performance. Thus the highlights presented in this paper include graphemic a...

متن کامل

Multilingual Projection for Parsing Truly Low-Resource Languages

We propose a novel approach to cross-lingual part-of-speech tagging and dependency parsing for truly low-resource languages. Our annotation projection-based approach yields tagging and parsing models for over 100 languages. All that is needed are freely available parallel texts, and taggers and parsers for resource-rich languages. The empirical evaluation across 30 test languages shows that our...

متن کامل

Neural network language models for low resource languages

For resource rich languages, recent works have shown Neural Network based Language Models (NNLMs) to be an effective modeling technique for Automatic Speech Recognition, out performing standard n-gram language models (LMs). For low resource languages, however, the performance of NNLMs has not been well explored. In this paper, we evaluate the effectiveness of NNLMs for low resource languages an...

متن کامل

Multilingual native language identification

We present the first study of Native Language Identification (NLI) applied to text written in languages other than English, using data from six languages. NLI is the task of predicting an author’s first language (L1) using only their writings in a second language (L2), with applications in Second Language Acquisition and forensic linguistics. Most research to date has focused on English but the...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Asian and Low-Resource Language Information Processing

سال: 2021

ISSN: ['2375-4699', '2375-4702']

DOI: https://doi.org/10.1145/3457610